ESKD Risk Prediction Model in a Multicenter Chronic Kidney Disease Cohort in China: A Derivation, Validation, and Comparison Study

Background and objectives: In light of the growing burden of chronic kidney disease (CKD), it is of particular importance to create disease prediction models that can assist healthcare providers in identifying cases of CKD individual risk and integrate risk-based care for disease progress management. The objective of this study was to develop and validate a new pragmatic end-stage kidney disease (ESKD) risk prediction utilizing the Cox proportional hazards model (Cox) and machine learning (ML). Design, setting, participants, and measurements: The Chinese Cohort Study of Chronic Kidney Disease (C-STRIDE), a multicenter CKD cohort in China, was employed as the model’s training and testing datasets, with a split ratio of 7:3. A cohort from Peking University First Hospital (PKUFH cohort) served as the external validation dataset. The participants’ laboratory tests in those cohorts were conducted at PKUFH. We included individuals with CKD stages 1~4 at baseline. The incidence of kidney replacement therapy (KRT) was defined as the outcome. We constructed the Peking University-CKD (PKU-CKD) risk prediction model employing the Cox and ML methods, which include extreme gradient boosting (XGBoost) and survival support vector machine (SSVM). These models discriminate metrics by applying Harrell’s concordance index (Harrell’s C-index) and Uno’s concordance (Uno’s C). The calibration performance was measured by the Brier score and plots. Results: Of the 3216 C-STRIDE and 342 PKUFH participants, 411 (12.8%) and 25 (7.3%) experienced KRT with mean follow-up periods of 4.45 and 3.37 years, respectively. The features included in the PKU-CKD model were age, gender, estimated glomerular filtration rate (eGFR), urinary albumin–creatinine ratio (UACR), albumin, hemoglobin, medical history of type 2 diabetes mellitus (T2DM), and hypertension. In the test dataset, the values of the Cox model for Harrell’s C-index, Uno’s C-index, and Brier score were 0.834, 0.833, and 0.065, respectively. The XGBoost algorithm values for these metrics were 0.826, 0.825, and 0.066, respectively. The SSVM model yielded values of 0.748, 0.747, and 0.070, respectively, for the above parameters. The comparative analysis revealed no significant difference between XGBoost and Cox, in terms of Harrell’s C, Uno’s C, and the Brier score (p = 0.186, 0.213, and 0.41, respectively) in the test dataset. The SSVM model was significantly inferior to the previous two models (p < 0.001), in terms of discrimination and calibration. The validation dataset showed that XGBoost was superior to Cox, regarding Harrell’s C, Uno’s C, and the Brier score (p = 0.003, 0.027, and 0.032, respectively), while Cox and SSVM were almost identical concerning these three parameters (p = 0.102, 0.092, and 0.048, respectively). Conclusions: We developed and validated a new ESKD risk prediction model for patients with CKD, employing commonly measured indicators in clinical practice, and its overall performance was satisfactory. The conventional Cox regression and certain ML models exhibited equal accuracy in predicting the course of CKD.


Introduction
Chronic kidney disease (CKD) is a leading public health problem worldwide. The estimated prevalence of CKD is 9.1% [1] globally and 10.8% in China [2]. The disease burden has been increasing significantly because of the rise in diabetes, hypertension, and aging [2][3][4][5], and it also contributes to the increased burden of end-stage kidney disease (ESKD) requiring kidney replacement therapy, which incurs huge health costs. As a result of early asymptomatic stages of CKD and heterogeneous progression to ESKD, both healthcare providers and patients seek to predict disease prognosis for optimal risk-based individualized management. Tangri et al. [6,7] initially developed and validated the kidney failure risk equations (KFREs) using data from two nephrology referral centers in Canada and then data from the chronic kidney disease prognosis consortium (CKD-PC), which includes millions of patients with CKD stages 3-5. For populations of Asian ancestry, the model has been externally validated in populations from Korea [8] and Singapore [9], demonstrating satisfactory performance. However, participants with early-stage CKD were sub-optimally represented in prior studies. Most participants in previous studies did not have glomerular disease, which is still the common cause of CKD in developing countries. In addition, several models (traditional statistical or ML algorithms) for ESKD prediction exist, but they are limited, partly due to sample size [10], external validation, ML model interpretation, and clinical application [11][12][13]. None of them were developed for a population with CKD stages 1-4. ML appears to make fewer assumptions and may be more accurate in predictive performance [14,15]. However, the traditional Cox regression model may lose the opportunity to identify and involve the key clinical features of CKD in the prediction model, which may be complemented by ML algorithms.
Using a multicenter CKD research cohort of patients with CKD stages 1-4 under the care of nephrologists, we used Cox and ML to develop and validate a pragmatic risk prediction model for ESKD at two and five years, based on supervised routinely available features, and we additionally compared their prediction accuracies.

Ethics Approval and Declaration
This study adhered to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) reporting guidelines.

Development Cohort
This development cohort was derived from C-STRIDE [16,17], the first nationwide CKD cohort in China, along with 39 clinical centers located in 28 cities from 22 provinces. All of these clinical centers are renal departments from different hospitals. Participants who met the following criteria were eligible for enrollment: (1) aged 18-74 years and (2) specified eGFR range, according to different CKD etiologies. For glomerulonephritis (GN) patients, the eGFR should be ≥15 mL/min/1.73 m 2 . For diabetic nephropathy (DN) patients, the defining eligibility was 15 mL/min/1.73 m 2 ≤ eGFR < 60 mL/min/1.73 m 2 or eGFR ≥ 60 mL/min/1.73 m 2 with "nephrotic range" proteinuria. For non-GN and non-DN patients, 15 mL/min/1.73 m 2 ≤ eGFR < 60 mL/min/1.73 m 2 was set for enrollment. Patients aged >70 years or without baseline eGFR and demographic values or with a follow-up time of <3 months were excluded ( Figure 1). Their laboratory test data were collected in central laboratories at baseline and, along with demographics and anthropometrics, were annually collected and evaluated throughout the study. The outcomes were defined as kidney replacement therapy (KRT; maintenance dialysis or renal replacement) at three-month intervals, as ascertained by each clinical center. We censored patients without KRT events throughout the limited follow-up period due to death, dropout, or 31 December 2017, whichever came first. aged >70 years or without baseline eGFR and demographic values or with a follow-up time of <3 months were excluded ( Figure 1). Their laboratory test data were collected in central laboratories at baseline and, along with demographics and anthropometrics, were annually collected and evaluated throughout the study. The outcomes were defined as kidney replacement therapy (KRT; maintenance dialysis or renal replacement) at threemonth intervals, as ascertained by each clinical center. We censored patients without KRT events throughout the limited follow-up period due to death, dropout, or 31 December 2017, whichever came first.

Validation Cohort
The prospective CKD cohort at PKUFH, which included CKD G1-G4 with various etiologies enrolled, was used as the external validation cohort. This cohort met the exclusion criteria of the development dataset and was, therefore, designated the validation dataset. The KRT outcome was also documented during the period of follow-up. Since the maximum follow-up length in the development set was six years, data from the patients in the validation set with a follow-up longer than six years were censored.

Candidate Variables
The baseline visit included age, gender, resting blood pressure, and comorbidities history, including type 2 diabetes mellitus (T2DM), hypertension, and cardiovascular disease, obtained at each center. Laboratory tests were collected for each patient (hemoglobin, creatine, albumin, bicarbonate, fasting blood glucose, uric acid, blood lipids panel, and serum electrolytes). Among these items, laboratory tests, 24-h urine electrolytes, and urine ACR were measured in the central laboratory (PKUFH) to avoid variation of the

Validation Cohort
The prospective CKD cohort at PKUFH, which included CKD G1-G4 with various etiologies enrolled, was used as the external validation cohort. This cohort met the exclusion criteria of the development dataset and was, therefore, designated the validation dataset. The KRT outcome was also documented during the period of follow-up. Since the maximum follow-up length in the development set was six years, data from the patients in the validation set with a follow-up longer than six years were censored.

Candidate Variables
The baseline visit included age, gender, resting blood pressure, and comorbidities history, including type 2 diabetes mellitus (T2DM), hypertension, and cardiovascular disease, obtained at each center. Laboratory tests were collected for each patient (hemoglobin, creatine, albumin, bicarbonate, fasting blood glucose, uric acid, blood lipids panel, and serum electrolytes). Among these items, laboratory tests, 24-h urine electrolytes, and urine ACR were measured in the central laboratory (PKUFH) to avoid variation of the testing values between laboratories. eGFR was evaluated by the CKD-EPI [18] creatinine equation. Candidate variables with missing values greater than 30% in the development dataset were excluded.

Data Preprocessing and Statistical Analysis
In this study, the C-STRIDE cohort was randomly divided into training and test datasets at a ratio of 7:3. The PKUFH CKD cohort was considered an independent external validation set. The training dataset was used for training survival models, while the internal test dataset and the external validation dataset were used for model evaluation.
Any feature with a missing rate greater than 30% in the development dataset was excluded. We imputed the remaining missing data using the multivariate data by chained equations (MICE) method, which can handle complex incomplete data. Moreover, to avoid the problem of information leakage, we conducted MICE imputation for the training dataset first, and then, for the test and external validation datasets, sequentially by means of the imputed training dataset. In addition, the urine ACR feature values were logarithmically transformed with a base of 10, due to a skewed distribution.
As applicable, all baseline characteristics are presented as means with standard deviations, medians with interquartile ranges (IQRs), and frequencies with percentages. An event per variable (EPV) value > 10 was used to calculate the sample size. On the basis of the candidate variables and the number of ESKD occurrences, the sample size of the development cohort was adequate. The survival station of two datasets was described using Kaplan-Meier curves. In any hypothesis test, a type I error was set at the 0.05 level.

Model-Driven Feature Selection
During the building of a predictive model, feature selection is integral. We searched for the predictive feature of ESKD reported in previous studies [19][20][21]. In addition, we utilized the training set to identify potential features in conjunction with the Cox model stepwise selection (both directions, Akaike information criterion), as well as the least absolute shrinkage and selection operator (Cox-LASSO) approaches, to find the optimal model via cross-validation. Partial likelihood deviance within the acceptable ranges was determined the lambda. Expert knowledge and clinical cost-effectiveness were integrated to determine the final features.

Model Training
During model development, we investigated the Cox regression model and two ML models, namely XGBoost [22] and SSVM [23]. XGBoost is a gradient-boosting ensemble model consisting of a group of decision trees. It can deal with survival prediction tasks by means of learning each tree using the survival loss function setting. As another survival machine learning model, SSVM is an extension of the conventional SVM algorithm and has also been applied in biomedical studies [24]. For each ML model, we first tuned the hyperparameters by means of the Bayesian optimization algorithm, aiming at maximizing the average of the C-index values of fivefold cross-validation. Then, we fit the models with the optimal hyper-parameters to the entire training dataset again to acquire the final model and then applied it to the inner test and external validation datasets for prediction. Given the feature values of a patient, the corresponding prediction results included a prognostic index (PI) and an individual survival curve, which were further used for model evaluation.

Model Evaluation and Comparison
We considered a series of metrics to measure model discrimination. The first and most widely used measurement was Harrell's concordance index (Harrell's C-index). Uno's concordance (Uno's C) was also included, as it is preferable for high censoring cases [25]. In addition to the global concordances, we also considered time-dependent AUC (TD-AUC) for discriminability at specific time points. In this study, we mainly focused on ESKD at two and five years.
In addition, the overall performance (or calibration) was carried out using the Brier score, which calculated the squared differences between the actual outcomes and the pre-dictions. The Brier score ranges from 0 (prediction and results are identical) to 1 (discordant prediction). Accordingly, the calibration curves visualized the difference between the predicted and observed survival probability.
The statistical inference and hypothesis tests for Harrell's and Uno's C-index, TD-AUC, and the Brier score were based on non-parametric bootstrap resampling techniques, such as variance, 95% confidence intervals (CIs), and difference tests between models. Briefly speaking, the samples in each dataset were randomly resampled 1000 times to generate a group of bootstrap values for the statistical estimation of the metrics. Then, the CIs, z-statistics, and the corresponding p-values in the hypothesis testing were derived based on these bootstrap metric values [26].

Implementation Setup
This study was conducted using Python (version 3.

Cohort Description
The development and external validation cohorts eventually comprised 3216 (2251 patients in the training set and 965 patients in the test set) and 342 patients, respectively (Table 1 and Figure 1). The observed incidence of the outcome ranged from 411 (12.8%) events in the C-STRIDE cohort to 25 (7.3%) events in the validation cohort. The mean time to event was 4.5 and 3.4 years in the respective cohorts. The Kaplan-Meier survival curve is shown in Figure S1. The mean age of the study population was 48 and 55 years. The mean eGFR was 52.97 and 50.83 mL/min/1.73 m 2 , and more than 70% of the patients were in CKD stages 1-3. Glomerulonephritis was the primary etiology of CKD in C-STRIDE. The baseline urinary ACR median (25th-75th percentile) was 376.40 (90.80, 911.45) mg/g and 214.37 (43.55, 1058.50) mg/g. The baseline survival rate of the training sets was 0.9969 and 0.9827 at two and five years. The other baseline characteristics of the CKD patients are presented in Table 1.

Feature Selection
By employing multivariable stepwise Cox analysis and Cox-LASSO regression, the characteristics that predicted ESKD were reduced further (Tables S1 and S2 and Figure S2). Age, gender, eGFR, urine ACR, albumin, hemoglobin, medical history of T2DM, and hypertension were included in the final model.

Model Performance
The findings of the discrimination and calibration (Brier score) are displayed in Table 2. In the Cox model, Harrell's C-indices were 0.841, 0.834, and 0.761, while the Uno's C-indices were 0.807, 0.833, and 0.796 in the training, testing, and validation datasets, respectively. The XGBoost model provided similar findings for Harrell's C-index, but when applied to all datasets, Uno's C-index values of 0.836, 0.825, and 0.822 were more consistent in three datasets. The XGBoost method performed better than Cox, in terms of the two-and fiveyear TD-AUC in the validation. Compared to the preceding two models, the discrimination of the SSVM model was less satisfactory, given the Harrell's C-index values of 0.748 and 0.745 in testing and validation. The Cox model and XGBoost algorithm performed similarly, in terms of the Brier scores (0.065 and 0.066), while that of the SSVM was 0.070.
The calibration curves exhibited suboptimal performance at certain risk thresholds. ( Figure S3). The curves of the Cox model and SSVM revealed underestimation and overestimation, respectively, in the testing set for the high-risk population at 2 and 5 years. The calibration of XGBoost was generally centered on the 45-degree line, whereas high-risk groups were overestimated.

Model Comparison
A comparative analysis of the performance of the Cox, XGBoost, and SSVM models demonstrated that XGBoost performed significantly better than Cox and SSVM in the training set for both discrimination (Harrell's C and Uno's C) and calibration (p < 0.001). When used in the testing set, XGBoost performed similarly to Cox, in terms of Harrell's C, Uno's C, and the Brier score (p = 0.186, 0.213, and 0.141, respectively), but was statistically better than SSVM (p < 0.001). The validation set showed that XGBoost was superior to Cox, regarding Harrell's C, Uno's C, and the Brier scores (p = 0.003, 0.027, and 0.032, respectively), while Cox and SSVM were almost identical concerning these three parameters.

Web Application
The final PKU-CKD prognostic model was displayed via a clinical decision support system (CDSS) embedded in the hospital EHR system for further regional and prospective evaluation ( Figure S4). The model's absolute risk of KRT, shown in the interface, was calculated based on Cox algorithms. The XGBoost model results were calculated at the back-end to further evaluate the prediction accuracy. We established age, gender, eGFR, and UACR as the initial conditions in the actual operation of the model. This makes it particularly applicable for use by clinicians who make decisions in the absence of additional data. For better visualization and user experience, the output score for each patient was normalized between 0 and 100, and we presented the two-and five-year ESKD risk values, rather than the survival probability. We also introduced the impact of feature values, according to the local interpretable model-agnostic explanations (LIME) algorithm [27], which could enhance the model explanation in an application. The original equation displayed the regression coefficient of the features and baseline hazard at two and five years in the Cox model (Table S3).

Discussion
In this study, using a multicenter cohort with a CKD stage of 1-4, mainly consisting of glomerulonephritis (57.6%), we successfully developed and externally validated a CKD model to predict the absolute risk of KRT in two-and five-year periods. In the training, testing, and validation datasets, the Cox model yielded C-index values of 0.807, 0.833, and 0.796, while XGBoost produced almost identical results with C-index values of 0.836, 0.825, and 0.822. Therefore, this risk model may aid in individualized patient management in the Chinese population. Although the Cox and XGBoost algorithms were basically equivalent in the test populations, the latter was superior in its external validation.
KFREs are the most frequently used ESKD risk prediction models in the world. They were initially developed in a Canadian population and showed high discrimination and adequate calibration, validated in 31 multinational cohorts (overall C statistic, 0.90; 95% CI, 0.89-0.92 at two years; C statistic at five years, 0.88; 95% CI, 0.86-0.90) [6]. However, most of these populations were patients with non-glomerular diseases. Cardiovascular disease or death was more common than kidney failure events in most of these cohorts. In the Chinese and Asian pacific areas, glomerular renal disease is still the leading cause of ESKD, rather than diabetes-based kidney disease as in Western countries. In the C-STRIDE cohort, where nearly 60% of the participants had glomerular disease, we validated the KFREs and found that the performances, expressed as the C statistic of the eight-variable equation, at two and five years were 0.79 (95% CI, 0.80-0.77) and 0.75 (95% CI, 0.76-0.74), respectively (Table S4). Similarly, in a prior glomerulonephritis cohort study, the C statistic in the validation cohort was 0.72 (95% CI, 0.67-0.78) [28]. This suggests that the addition of a calibration factor or remodeling may be necessary.
The C-STRIDE study recruited participants mainly from 39 kidney disease research centers across China. Within this cohort, we developed a new equation of kidney risk prediction with better performance than the KFREs [29]. Our study was developed and validated using both the Cox and ML (SSVM and XGBoost) algorithms. We realized that the ability of the Cox and XGBoost algorithms to differentiate between patients with and without ESKD was robust, whereas the XGBoost model was slightly higher than the Cox model, in terms of validation metrics. Conversely, the Cox model's overall ability is superior to that of SSVM; this implies that the Cox model might be comparable to or even better than some ML models for time-to-event data. However, some studies have unleashed the predictive power of machine learning far beyond traditional statistics and clinical experts, such as a study assessing dry weight in pediatric patients on chronic hemodialysis [30] and predicting the risk of incident cardiovascular [31]. The ability and feasibility of ML in predictive models are still questioned, due to the inherent overfitting [11] and "black box" characteristics in most studies [32][33][34][35]. In other words, it is not possible to understand precisely how a computation approximates a particular function. Further, higher placement in machine learning does not imply superiority. Instead, ML is deeply related to traditional statistical models, which are recognizable to most clinicians and require a combination of clinician-supervised and data-driven processes. Existing CKD prognostic models mostly include well-known risk factors for disease progression to ESKD, which is inseparable from the efforts of physicians and statistics.
The strength of this study is that a CKD prognostic model was developed using a large national multicenter cohort, employing both the Cox and XGBoost methods. The predictors of our model were regularly examined in the majority of centers, making the model practical and well-suited to routine clinical practice. However, this study has several limitations. First, our model was based on a Chinese cohort, and it still needs to be confirmed in other populations. Second, this cohort recruited patients from a kidney center, and more than half of them had glomerular disease. The mean age of this cohort was younger than that of the KFRE cohort, of which the participants with kidney disease developed this condition mainly due to diabetes or hypertension. Thus, this model may be more appropriate for tertiary specialty hospitals in developing countries than for general care. Third, this model could not account for treatment-related variables, due to a lack of data on medicines that might affect CKD prognosis.

Conclusions
In conclusion, we developed and validated the PKU-CKD prognostic risk model commonly measured in clinical practice, and the overall performance was well-discriminated and calibrated. Conventional regression and ML models may show comparable robustness in predicting the progression of CKD.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm12041504/s1, Figure S1: The Kaplan-Meier survival curves in training, testing, and validation dataset. Figure S2: a. This Cox-LASSO figure shows that as the penalty coefficient lambda increases, the regression coefficient gradually compresses to 0 to reduce the features. b. According to this Cox-LASSO cross-validation figure, as lambda increases, partial likelihood deviance gradually increases. Figure S3: Visual plots of calibration for Coxph (2a.), SSVM(2b.), and XGBoost(2c.), 2 and 5-years in training, testing, and validation set. Figure S4: The PKU-CKD prognostic model clinical decision support system (Absolute risk predictions for KRT at 2and 5-year for a scenario for a 64-year-old man with a history of diabetes and hypertension, an eGFR 50 mL/min/1.73 m 2 and a urine albumin-to-creatinine ratio of 150 mg/g, an albumin 23 g/L, and hemoglobin 100 g/L). Table S1: Multivariable Cox analysis using both directions step in the training dataset. Table S2: Feature selection in Cox-LASSO cross-validation with 1 standard error lambda. Table S3: The equation to apply 2 or 5 years of the selected features to an Individual Patient. Table S4: Validation of 8-variable KFREs performance using CKD G3~G5 patients in 2 and 5 years.  Institutional Review Board Statement: The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Peking University First Hospital(protocol code 2011(363)).